CN118282471A

CN118282471A - Method for distributing bandwidth resources of satellite communication network and related equipment

Info

Publication number: CN118282471A
Application number: CN202410182515.1A
Authority: CN
Inventors: 欧清海; 姜燕; 蒋月
Original assignee: Beijing Zhongdian Feihua Communication Co Ltd
Current assignee: Beijing Zhongdian Feihua Communication Co Ltd
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-07-02

Abstract

The application provides a method for distributing satellite communication network bandwidth resources and related equipment, which considers the importance of samples in different storage areas, distinguishes the importance of the samples in different storage areas so as to improve the training efficiency and precision of a deep reinforcement learning model, and can solve the problem that fixed bandwidth distribution resources cannot adapt to the dynamic change of a satellite network and the flexible requirements of different communication services because training experience samples corresponding to the current network environment state are added in the combined target training samples, so that the strategy learned by the deep reinforcement learning model is not applicable to the current network environment, the precision of the deep reinforcement learning model can be improved, and the satellite communication network bandwidth resources are distributed to the satellite network nodes corresponding to the maximum state action value according to the target network bandwidth occupation value, thereby guaranteeing the full utilization and performance of the bandwidth resources.

Description

Method for distributing bandwidth resources of satellite communication network and related equipment

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method and related device for allocating bandwidth resources of a satellite communications network.

Background

In the context of power system operation, satellite communications may provide wide area coverage enabling remote power plants and distributed energy facilities to access the power system without impediment and to transmit critical information such as energy production in real time. Because the bandwidth resources of the satellite fusion network are limited and the communication requirements of service diversification need to be met, the distribution requirements of the bandwidth resources are more refined and stricter, and for the satellite fusion network, the better bandwidth resource distribution mode can obviously improve the efficiency and the utilization rate of a communication system, optimize the network performance and improve the service experience.

However, the conventional network bandwidth resource allocation strategy at present fixes bandwidth allocation resources, cannot adapt to dynamic changes of a satellite network and flexible requirements of different communication services, and easily causes bandwidth resource waste and low performance.

Disclosure of Invention

Accordingly, an objective of the present application is to provide a method and related equipment for allocating bandwidth resources of a satellite communication network, which are used for solving or partially solving the above-mentioned technical problems.

Based on the above object, a first aspect of the present application provides a method for allocating bandwidth resources of a satellite communication network, including:

Determining a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services;

selecting training experience samples from the training experience samples in the first storage area according to the first priority, and selecting training experience samples from the training experience samples in the second storage area according to the second priority;

Acquiring an experience sample for training corresponding to the current network environment state, and combining the experience sample for training in the first storage area, the experience sample for training selected in the second storage area and the experience sample for training corresponding to the current network environment state to obtain a sample for combined target training;

Training a pre-constructed deep reinforcement learning model by utilizing the combined target training sample to obtain a trained deep reinforcement learning model;

acquiring real-time network environment state information, and respectively determining state action values obtained by selecting each star network node to execute communication service by using the trained deep reinforcement learning model based on the real-time network environment state information;

Determining the maximum state action value in the state action values corresponding to all the star network nodes, acquiring a target network bandwidth occupation value when the star network node corresponding to the maximum state action value executes communication service, and distributing satellite communication network bandwidth resources to the star network node corresponding to the maximum state action value according to the target network bandwidth occupation value.

Optionally, before determining the first priority of each training experience sample stored in the preset first storage area and the second priority of each training experience sample stored in the preset second storage area, the method further includes:

acquiring the current transmission demand state of any one of a plurality of user terminals;

Selecting a target star network node from a plurality of star network nodes to execute communication service based on the current transmission demand state, and acquiring a network bandwidth limit value of a transmission channel between any user terminal and the target star network node, a network bandwidth occupation value when executing the communication service and a next transmission demand state of the current transmission demand state;

Determining a prize value using the network bandwidth limit and the network bandwidth occupancy value;

Storing the training experience sample formed by combining the current transmission demand state, the target star network node, the reward value and the next transmission demand state to the first storage area or a preset third storage area according to the reward value;

Training a pre-constructed long-period memory network model by using the training experience sample stored in the first storage area to obtain a trained long-period memory network model;

And inputting the training experience sample stored in the third storage area into a trained long-term and short-term memory network model for prediction to obtain a prediction experience sample, and storing the prediction experience sample into the second storage area.

Optionally, the determining the prize value using the network bandwidth limit and the network bandwidth occupancy value includes:

Acquiring the number of the star network nodes;

Performing product processing by using the number of the star network nodes and the network bandwidth limit value to obtain a first product processing result;

and carrying out ratio processing on the network bandwidth occupation value and the first product processing result to obtain the rewarding value.

Optionally, the storing the training experience sample formed by combining the current transmission requirement state, the target star network node, the reward value and the next transmission requirement state according to the reward value in the first storage area or a preset third storage area includes:

Judging whether the rewarding value is larger than or equal to a preset rewarding value threshold value or not, and obtaining a judging result;

Responding to the judgment result, and storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the first storage area; or alternatively

And if the judgment result is negative, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the third storage area.

Optionally, the determining the first priority of each training experience sample stored in the preset first storage area includes:

determining a first cosine similarity between the selected training empirical samples in the first storage area and the unselected training empirical samples in the first storage area;

Acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to a current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to a next transmission demand state of the current transmission demand state;

performing difference processing on the number of the selected training experience samples and the first cosine similarity to obtain a first difference processing result;

Performing ratio processing by using the first difference value processing result and the number of the selected training experience samples to obtain a first ratio processing result;

And carrying out product processing on the first ratio processing result and the absolute value of the time difference error to obtain the first priority.

Optionally, the determining the second priority of each training experience sample stored in the preset second storage area includes:

Determining a second cosine similarity between the selected training empirical samples in the second storage area and the unselected training empirical samples in the second storage area;

performing difference processing on the number of the selected training experience samples and the second cosine similarity to obtain a second difference processing result;

performing ratio processing by using the second difference value processing result and the number of the selected training experience samples to obtain a second ratio processing result;

And carrying out product processing on the second ratio processing result and the absolute value of the time difference error to obtain the second priority.

Optionally, training the pre-constructed deep reinforcement learning model by using the combined target training sample to obtain a trained deep reinforcement learning model, including:

acquiring a real state action value corresponding to the combined target training sample;

inputting the combined target training sample into the pre-built deep reinforcement learning model, and outputting a predicted state action value through the pre-built deep reinforcement learning model;

And constructing a loss function based on the real state action value and the predicted state action value, performing minimization treatment on the loss function by using an epsilon-greedy strategy, and performing training adjustment on the pre-constructed deep reinforcement learning model according to the minimization treatment result to obtain a trained deep reinforcement learning model.

Optionally, the real-time network environment status information includes: current star network state parameters;

Based on the real-time network environment state information, the state action value obtained by selecting each star network node to execute the communication service is respectively determined by utilizing the trained deep reinforcement learning model, and the method comprises the following steps:

Acquiring an instant rewarding value and a discount factor obtained by executing any star network node under the current star network state parameter, and selecting a current state action value obtained by the star network node corresponding to the current maximum state action value under the current star network state parameter;

performing product processing by using the discount factor and the action value of the current state to obtain a second product processing result;

And adding the second product processing result and the instant rewarding value to obtain a state action value obtained by selecting each star network node to execute communication service.

Based on the same inventive concept, a second aspect of the present application provides an allocation apparatus of bandwidth resources of a satellite communication network, including:

A priority determining module configured to determine a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services;

A sample selection module configured to select training experience samples from among the training experience samples of the first storage area according to the first priority, and to select training experience samples from among the training experience samples of the second storage area according to the second priority;

The sample combination module is configured to acquire training experience samples corresponding to the current network environment state, and combine the training experience samples selected in the first storage area, the training experience samples selected in the second storage area and the training experience samples corresponding to the current network environment state to acquire a combined target training sample;

The training module is configured to train the pre-constructed deep reinforcement learning model by utilizing the combined target training sample to obtain a trained deep reinforcement learning model;

The value determining module is configured to acquire real-time network environment state information, and based on the real-time network environment state information, respectively determine state action values obtained by selecting each star network node to execute communication service by utilizing the trained deep reinforcement learning model;

The resource allocation module is configured to determine the largest state action value in the state action values corresponding to the star network nodes, acquire a target network bandwidth occupation value when the star network node corresponding to the largest state action value executes communication service, and allocate satellite communication network bandwidth resources to the star network node corresponding to the largest state action value according to the target network bandwidth occupation value.

A third aspect of the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.

As can be seen from the above, the method and the related device for allocating bandwidth resources of a satellite communication network provided by the application fully consider the importance of samples in different storage areas by selecting training experience samples from the first storage area according to the first priority and selecting training experience samples from the second storage area according to the second priority, so as to improve the training efficiency and precision of a deep reinforcement learning model, and form a combined target training sample as the training input of the deep reinforcement learning model in a combined mode.

Drawings

In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a flow chart of a method for allocating bandwidth resources of a satellite communication network according to an embodiment of the present application;

fig. 2 is a schematic diagram of a flow chart for allocating bandwidth resources of a satellite communication network according to an embodiment of the present application;

FIG. 3 is a block diagram illustrating a configuration of an apparatus for allocating bandwidth resources of a satellite communication network according to an embodiment of the present application;

Fig. 4 is a schematic diagram of an electronic device according to an embodiment of the application.

Detailed Description

The present application will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present application more apparent.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

It will be appreciated that before using the technical solutions of the embodiments of the present application, the user is informed of the type, the range of use, the use scenario, etc. of the related personal information in an appropriate manner, and the authorization of the user is obtained.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Therefore, the user can select whether to provide personal information to the software or hardware such as the electronic equipment, the application program, the server or the storage medium for executing the operation of the technical scheme according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization acquisition process is merely illustrative, and not limiting of the implementation of the present application, and that other ways of satisfying relevant legal regulations may be applied to the implementation of the present application.

Embodiments of the present application are described in detail below with reference to the accompanying drawings.

Bandwidth allocation in a satellite converged network refers to the process of efficiently allocating and managing bandwidth to meet communication traffic demands. In the context of power system operation, satellite communications may provide wide area coverage enabling remote power plants and distributed energy facilities to access the power system without impediment and to transmit critical information such as energy production in real time. For the satellite fusion network, the better network bandwidth resource allocation mode can obviously improve the efficiency and the utilization rate of the communication system, optimize the network performance and improve the service experience.

Because the bandwidth resources of the satellite fusion network are limited and the communication requirements of service diversification need to be met, the allocation requirements of the bandwidth resources are more refined and stricter.

For the satellite fusion network, the better bandwidth resource allocation mode can obviously improve the efficiency and the utilization rate of the communication system, optimize the network performance and improve the service experience. However, the conventional network bandwidth resource allocation strategy at present fixes bandwidth allocation resources, cannot adapt to dynamic changes of a satellite network and flexible requirements of different communication services, and easily causes bandwidth resource waste and low performance.

According to the method for allocating the bandwidth resources of the satellite communication network, the training experience samples are selected from the first storage area according to the first priority, the training experience samples are selected from the second storage area according to the second priority, the importance of the samples in different storage areas is fully considered, the importance of the samples in different storage areas is distinguished, so that the training efficiency and the accuracy of a deep reinforcement learning model are improved, a combined target training sample is formed in a combined mode to serve as the input of the training of the deep reinforcement learning model, and because the training experience samples corresponding to the current network environment state are added in the combined target training sample, the strategy of deep reinforcement learning model learning is not suitable for the current network environment, the accuracy of the deep reinforcement learning model is improved, the large state action value determined by the deep reinforcement learning model is more accurate, satellite communication network bandwidth resources are allocated to satellite network nodes corresponding to the largest state action value according to the target network bandwidth occupation value, the problem that fixed bandwidth allocation resources cannot be suitable for dynamic change of the satellite network and flexible requirements of different communication services can be avoided, and the bandwidth resources can be fully utilized.

As shown in fig. 1, the method of the present embodiment includes:

Step 101, determining a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services.

In this step, the conventional approach employs uniform random sampling to select the empirical samples for training the deep reinforcement learning model, enabling the deep reinforcement learning model to converge faster, but without taking into account the difference in useful information provided by the different empirical samples.

The method fully considers the importance of the samples in different storage areas, distinguishes the importance of the samples in different storage areas, and improves the training efficiency and the training precision of the deep reinforcement learning model by determining the first priority of each training experience sample stored in a preset first storage area and the second priority of each training experience sample stored in a preset second storage area.

Step 102, selecting training experience samples from the training experience samples in the first storage area according to the first priority, and selecting training experience samples from the training experience samples in the second storage area according to the second priority.

In this step, training empirical samples are selected from the first storage area based on the first priority, and training empirical samples are selected from the second storage area using the second priority such that the selected empirical samples contain as much useful information as possible, thereby reducing the number of states that the deep reinforcement learning model must explore or utilize, helping the deep reinforcement learning model to converge quickly.

And 103, acquiring training experience samples corresponding to the current network environment state, and combining the training experience samples selected in the first storage area, the training experience samples selected in the second storage area and the training experience samples corresponding to the current network environment state to obtain a combined target training sample.

In the step, because the training experience sample corresponding to the current network environment state is added in the combined target training sample, the combined target training sample is used as the input of the deep reinforcement learning model training, so that the condition that the deep reinforcement learning model learning strategy is not suitable for the current network environment can be avoided, and the precision of the deep reinforcement learning model can be improved.

And 104, training a pre-constructed deep reinforcement learning model by using the combined target training sample to obtain a trained deep reinforcement learning model.

In this step, the trained Deep reinforcement learning model may be a Deep Q-network (DQN) model, a modified Deep DQN (DDQN) model, and a deterministic strategy gradient model using Deep learning (DEEP DETERMINISTIC Policy Gradien, DDPG) model, preferably a DDQN model.

The deep reinforcement learning model combines a neural network with the reinforcement learning algorithm to solve the decision problem with delay rewards, so that the deep reinforcement learning model can learn autonomously from the environment and make decisions according to learned experiences.

Deep reinforcement learning is performed during training, and learning and lifting are performed by interacting with the environment. By observing the state of the environment, appropriate actions are selected and performed, and then a reward signal is obtained from the environment as feedback. The goal is to find a strategy to maximize long-term jackpot by learning. In adaptive network resource allocation, by modeling the network bandwidth resource allocation problem as a reinforcement learning problem, an optimal network bandwidth resource allocation strategy can be learned by interacting with the environment. The deep reinforcement learning has complex environment and task modeling capability, and can learn an automatic optimization strategy, so that the method is very suitable for solving the problem of self-adaptive bandwidth resource allocation in a star-ground fusion network.

In addition, the importance of the samples in different storage areas is fully considered by the combined target training sample, so that the training efficiency and the training precision of the deep reinforcement learning model are improved, the combined target training sample is formed in a combined mode to serve as the input of the deep reinforcement learning model training, and the training experience sample corresponding to the current network environment state is added into the combined target training sample, so that the situation that the deep reinforcement learning model learning strategy is not suitable for the current network environment can be avoided, and the precision of the deep reinforcement learning model can be improved.

Step 105, acquiring real-time network environment state information, and based on the real-time network environment state information, respectively determining state action values obtained by selecting each star network node to execute communication service by using the trained deep reinforcement learning model.

In the step, the trained deep reinforcement learning model can be suitable for real-time network environments at different moments, so that the state action value obtained by selecting each star network node to execute communication service can be more accurate by utilizing the trained deep reinforcement learning model.

Step 106, determining the largest state action value in the state action values corresponding to all the star network nodes, obtaining a target network bandwidth occupation value when the star network node corresponding to the largest state action value executes communication service, and distributing satellite communication network bandwidth resources to the star network node corresponding to the largest state action value according to the target network bandwidth occupation value.

In the step, the large state action value determined by the deep reinforcement learning model can be more accurate, and further satellite network bandwidth resources are allocated to the satellite network nodes corresponding to the maximum state action value according to the target network bandwidth occupation value so as to adapt to the dynamic change of the satellite network and the flexible requirements of different communication services, and in addition, the full utilization and performance of the bandwidth resources can be ensured, so that the problem that the traditional fixed bandwidth allocation resources cannot adapt to the dynamic change of the satellite network and the flexible requirements of different communication services can be solved.

According to the scheme, the training experience samples are selected from the first storage area according to the first priority, the training experience samples are selected from the second storage area according to the second priority, the importance of the samples in different storage areas is fully considered, the importance of the samples in different storage areas is distinguished, the training efficiency and the accuracy of the deep reinforcement learning model are improved, the combined target training samples are formed in a combined mode to serve as the input of the deep reinforcement learning model training, and because the training experience samples corresponding to the current network environment state are added in the combined target training samples, the strategy of deep reinforcement learning model learning is prevented from being inapplicable to the current network environment, the accuracy of the deep reinforcement learning model is improved, the large state action value determined by the deep reinforcement learning model is more accurate, the satellite communication network bandwidth resource is further distributed to the satellite network nodes corresponding to the largest state action value according to the target network bandwidth occupation value, the problem that fixed bandwidth distribution resources cannot adapt to the dynamic change of a satellite network and the flexible requirements of different communication services can be avoided, and the full utilization and the performance of bandwidth resources can be guaranteed.

In some embodiments, prior to step 101, the method further comprises:

Step A1, a current transmission demand state of any one of a plurality of user terminals is obtained.

And step A2, selecting a target star network node from a plurality of star network nodes to execute communication service based on the current transmission demand state, and acquiring a network bandwidth limit value of a transmission channel between any user terminal and the target star network node, a network bandwidth occupation value when executing the communication service and a next transmission demand state of the current transmission demand state.

And A3, determining a reward value by utilizing the network bandwidth limit value and the network bandwidth occupation value.

And step A4, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the reward value and the next transmission demand state into the first storage area or a preset third storage area according to the reward value.

And step A5, training the pre-constructed long-short-period memory network model by using the training experience sample stored in the first storage area to obtain a trained long-short-period memory network model.

And step A6, inputting the training experience sample stored in the third storage area into a trained long-term and short-term memory network model for prediction to obtain a prediction experience sample, and storing the prediction experience sample into the second storage area.

In the scheme, modeling is performed on the satellite fusion network to obtain a set N of user terminals, a set M of satellite network nodes and an action space A _n. In each time step, the current transmission demand state S is acquired from the network environment, and an action a is selected based on this state, representing the selection of a certain star network node for communication traffic. This action is performed in the satellite fusion network and then returns the next transmission demand state s' and the prize value for that decision. At this point, it is determined whether the action selection was successful based on the prize value, and if so, the quadruple (s, a, r, s') is placed in a successful playback buffer (i.e., the first storage region). If not, it is placed in the failed playback buffer (i.e., the third storage region).

Specifically, the satellite fusion network under consideration is modeled as follows: the set of user terminals is n= {1,2, …, M }, and the set of star network nodes is m= {1,2, …, M }. The action space of any user is a _n = {0,1,2, …, M }, that is, the user can select any node in the set M of the star network nodes to perform task data transmission. The state space s is represented by the current transmission demand state of each user, where 0 indicates that the current user has no transmission demand and 1 indicates that the current user has a transmission demand. Each time user n transmits task data to a star network node m, a certain bandwidth is occupied, which is denoted as b _nm, and the maximum bandwidth limit (i.e., network bandwidth limit) of each transmission channel is b _max. Each user terminal corresponds to at most one transmission task at a time.

After the user terminal performs the communication service at the star network node at the selection time t, a general prize (i.e. a prize value) r _t(s_t,a_t for the star network is obtained.

And determining whether the action of the current round is successful or not according to the universal rewards of the bandwidth allocation decision of each round of the user. Successful experiences are stored in corresponding successful experience buffers, and failed experiences are temporarily stored in failed experience buffers.

When the user accumulates a certain amount of successful experience in the bandwidth resource allocation decision execution of the star-to-ground converged network, the LSTM network (i.e., the long-term memory network model) is trained using the successful experience (i.e., the training experience sample stored in the first storage region). The LSTM network is then used to process the user's failed experience (i.e., training experience samples stored in the third storage area) and store the processed experience (i.e., predicted experience samples) in the predicted experience buffer (i.e., the second storage area) to extract potential patterns that maximize the use of the user experience data.

Specifically, successful experience is utilized to train the LSTM network, so that the LSTM network can learn and predict bandwidth allocation decision data in a star-ground fusion network environment.

The failed experience is passed to the network for prediction. The result of the prediction is stored in a predicted playback buffer for subsequent sampling.

When a certain amount of effective experience is accumulated, samples are uniformly and randomly extracted from different experience pools according to preset weights. The sampled experience is input into DDQN networks (i.e., deep reinforcement learning models) to calculate Q values (i.e., state action values) and loss is calculated to update the networks until the main network converges. And optimizing the network bandwidth resource allocation decision according to the obtained continuously optimized DDQN target network so as to realize the high-efficiency self-adaptive allocation of the star-ground fusion network bandwidth and improve the overall performance of the star network.

Specifically, by mixing experience samples from different sources as inputs to the neural network (i.e., the deep reinforcement learning model), a mixed experience set E (i.e., the combined target training sample) is finally obtained. Sample sources include sample E ₀ from a pool of successful experiences (i.e., the training experience sample selected in the first storage area), sample E ₁ from a pool of predicted experiences (i.e., the training experience sample selected in the second storage area), and the most recent sample E ₂ from the current star network environment (i.e., the training experience sample corresponding to the current network environment state). The purpose of adding the sample generated by the latest interaction with the environment is to add the sample closest to the current state of the star network for learning, and the strategy for preventing learning is not applicable to the current network environment.

The DDQN network is trained using the data set obtained after empirical mixed sampling (i.e., combining the target training samples) to obtain a star-to-ground fusion network bandwidth resource allocation decision.

In some embodiments, step A3 comprises:

And step A31, obtaining the number of the star network nodes.

And step A33, performing product processing by using the number of the star network nodes and the network bandwidth limit value to obtain a first product processing result.

And step A33, carrying out ratio processing on the network bandwidth occupation value and the first product processing result to obtain the rewarding value.

In the above scheme, the number M of the star network nodes and the maximum bandwidth (i.e., the network bandwidth limit) b _max of each transmission channel are used to perform product processing, so as to obtain a first product processing result Mb _max.

Bandwidth occupied when transmitting task data to star network node a _n for user terminal nAnd the first product processing result Mb _max to obtain a prize value r _t(s_t,a_t).

The method can be expressed as follows:

Comprehensively considering the number of the star network nodes and the network bandwidth limit value can enable the determined rewarding value to be more accurate.

In some embodiments, step A4 comprises:

and step A41, judging whether the rewarding value is larger than or equal to a preset rewarding value threshold value, and obtaining a judging result.

And step A42, in response to the judgment result being yes, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewards value and the next transmission demand state into the first storage area. Or alternatively

And step A43, in response to the judging result being no, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state in the third storage area.

In the above scheme, when the reward value is greater than or equal to the preset reward value threshold, indicating that the action selection is successful, storing a training experience sample formed by combining the current transmission demand state, the target star network node, the reward value and the next transmission demand state into the first storage area.

And when the rewarding value is smaller than a preset rewarding threshold value, indicating that the action selection is not effective, and storing a training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into a third storage area.

For example, the prize value threshold is 0.5, when the prize value is greater than or equal to 0.5, the quadruple (s, a, r, s') is stored in the successful playback buffer (i.e., the first storage region), and the quadruple with the prize value less than 0.5 is stored in the failed playback buffer (i.e., the third storage region).

The quadruple is a training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state.

According to the reward value, importance distinction of samples is considered, and then the sample utilization rate and training efficiency are improved.

In some embodiments, in step 101, the determining the first priority of each training experience sample stored in the preset first storage area includes:

and B1, determining a first cosine similarity between the selected training experience samples in the first storage area and the unselected training experience samples in the first storage area.

And B3, acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to the current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to the next transmission demand state of the current transmission demand state.

And B3, carrying out difference processing on the number of the selected training experience samples and the first cosine similarity to obtain a first difference processing result.

And B4, carrying out ratio processing by using the first difference value processing result and the number of the selected training experience samples to obtain a first ratio processing result.

And step B5, carrying out product processing on the first ratio processing result and the absolute value of the time difference error to obtain the first priority.

In the above scheme, conventional DDQN employs uniform random sampling to select experience samples to train the network, enabling faster convergence of the network, but does not take into account the difference in useful information provided by the different experience samples in the experience playback pool.

The application provides a priority sampling mechanism based on cosine similarity and time difference, so that the extracted experience sample contains as much useful information as possible, thereby reducing the number of states which the network has to explore or utilize and helping the network to converge rapidly. Defining a first cosine similarity between the selected training empirical samples in the first storage region and the unselected training empirical samples in the first storage region as:

Further, the first priority of each experience sample is defined as:

Where S _i is the ith experience tuple in the extracted experience sample (i.e., the selected training experience sample in the first storage area), n is the total number of extracted experience samples (i.e., the number of selected training experience samples), E _t is the nth experience sample in the to-be-extracted experience sample (i.e., the unselected training experience samples in the first storage area), and delta _t is the TD-err (time difference error) of the experience, i.e., the difference between the rewards of the current state and the expected rewards of the next state. And when sampling, sampling independently in sequence, sampling the sample with the highest priority each time, updating the priority of the sample to be sampled according to the current sampling result, and repeating the steps until enough sample sets are extracted for carrying out DDQN network updating.

In some embodiments, in step 101, the determining the second priority of each training experience sample stored in the preset second storage area includes:

And C1, determining second cosine similarity between the selected training experience samples in the second storage area and the unselected training experience samples in the second storage area.

And C2, acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to the current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to the next transmission demand state of the current transmission demand state.

And C3, carrying out difference processing on the number of the selected training experience samples and the second cosine similarity to obtain a second difference processing result.

And C4, carrying out ratio processing by using the second difference value processing result and the number of the selected training experience samples to obtain a second ratio processing result.

And step C5, carrying out product processing on the second ratio processing result and the absolute value of the time difference error to obtain the second priority.

The application provides a priority sampling mechanism based on cosine similarity and time difference, so that the extracted experience sample contains as much useful information as possible, thereby reducing the number of states which the network has to explore or utilize and helping the network to converge rapidly. Defining a second cosine similarity between the selected training empirical samples in the second storage region and the unselected training empirical samples in the second storage region as:

further, the second priority of each experience sample is defined as:

Where S _i is the ith experience tuple in the extracted experience sample (i.e., the selected training experience sample in the second storage area), n is the total number of extracted experience samples (i.e., the number of selected training experience samples), E _t is the nth experience sample in the to-be-extracted experience sample (i.e., the unselected training experience samples in the second storage area), and delta _t is the TD-err (time difference error) of the experience, i.e., the difference between the rewards of the current state and the expected rewards of the next state. And when sampling, sampling independently in sequence, sampling the sample with the highest priority each time, updating the priority of the sample to be sampled according to the current sampling result, and repeating the steps until enough sample sets are extracted for carrying out DDQN network updating.

In some embodiments, step 104 comprises:

step 1041, obtaining a real state action value corresponding to the combined target training sample.

Step 1042, inputting the combined target training sample into the pre-built deep reinforcement learning model, and outputting the predicted state action value via the pre-built deep reinforcement learning model.

Step 1043, constructing a loss function based on the real state action value and the predicted state action value, performing minimization treatment on the loss function by using an epsilon-greedy strategy, and performing training adjustment on the pre-constructed deep reinforcement learning model according to the minimization treatment result to obtain a trained deep reinforcement learning model.

In the scheme, the deep reinforcement learning model updates the weight through a gradient descent method, and reduces the loss between the target Q value (namely the real state action value) and the predicted Q value (namely the predicted state action value) so as to realize the optimal bandwidth allocation decision as much as possible.

Because DDQN models have different learning degrees on the environment in the early stage and the later stage of the algorithm, the corresponding exploration and utilization probabilities should also be different, when the acquisition of the environment information in the early stage of the algorithm is less, the larger exploration probability should be adopted to acquire the environment information, and the smaller exploration probability should be adopted in the later stage of the algorithm to realize the development of the optimal strategy. In order to achieve a balance between detection and utilization of the algorithm, the application employs an epsilon adaptive adjustment mechanism that uses prize values obtained from the environment to determine whether to attenuate the rate of exploration. Only when a certain reward threshold is crossed, indicating that enough information is learned from the environment to support better decisions, the probability of detection can be reduced and the probability of utilization increased, the value of ε can be reduced, while the reward threshold is increased. In addition, after each state transition, the exploration probability epsilon(s) of each state is calculated according to the Boltzmann distribution difference of the value,

Wherein σ is a normal number, determining the influence of the selected action on the relevant state detection probability, and δ is the inverse of the action number of the state S, namelyThreadhold is the prize threshold, lambda _increment is the prize threshold growth factor, taking a constant greater than 1.

Further, the ε -greedy strategy is expressed as follows:

Wherein p _e is a random number of 0 to 1, and ε(s) (0 < ε(s) < 1) is the search probability calculated by the algorithm. The strategy randomly captures one action in action space |a(s) | with a probability of ε(s) to avoid trapping in local optimality.

And carrying out bandwidth allocation decision in real time according to the current network state and the predicted bandwidth demand by obtaining a bandwidth self-adaptive allocation decision scheme in the star-ground fusion network. The high-efficiency network bandwidth resource allocation of the star-ground fusion network is realized, the personalized requirements of various services are met, and the performance of the star-ground fusion network is improved.

In some embodiments, the real-time network environment status information includes: current star network state parameters.

In step 105, the determining, based on the real-time network environment status information, status action values obtained by selecting each star network node to execute the communication service by using the trained deep reinforcement learning model includes:

Step 1051, obtaining the instant prize value and discount factor obtained by executing any star network node under the current star network state parameter, and selecting the current state action value obtained by the star network node corresponding to the current maximum state action value under the current star network state parameter.

And step 1052, performing product processing by using the discount factor and the action value of the current state to obtain a second product processing result.

And 1053, adding the second product processing result and the instant rewards value to obtain a state action value obtained by selecting each star network node to execute the communication service.

In the above scenario, the deep reinforcement learning model reduces overestimation by decomposing the maximum operations in the target into action selection and action evaluation. Meanwhile, a calculation method of a target Q value in the deep Q-network is further improved, so that the Q value (namely the state action value) is more true.

The target Q value (i.e., state action value) is ultimately expressed as:

Wherein Y _t is a target Q value, r _t+1 is an instant rewarding value designed for the star-ground fusion network, which represents an instant rewarding obtained after the star-ground network node selection action (a _t) is performed under the current star-ground network state parameter (s _t), and gamma is a discount factor for measuring the importance of the future rewarding value. θ _t represents the current Q network parameter, i.e. the network weight at the current time (t), The Q network parameter representing the target is used to calculate the target Q value.Representing an action (a _t) of selecting the highest Q value (the current maximum state action value) given the satellite fusion network state parameters (s _t+1),And the current state action value obtained by selecting the star network node corresponding to the current maximum state action value under the current star network state parameter is represented.

In some embodiments, as shown in fig. 2, S1, selects and performs an action (i.e., selects a target star network node to perform a communication service) according to a current state (i.e., a current transmission demand state), obtains a next time state (i.e., a next transmission demand state of the current transmission demand state) and a reward (i.e., a reward value), and stores experience classifications according to the reward.

For example, S1, selecting and executing a star-to-ground network node selection action according to the current state, obtaining the next time state, calculating a general prize (i.e. a prize value) according to the formula (1), storing the quadruples (S, a, r, S') with the prize value greater than 0.5 (i.e. a preset prize value threshold) in a successful playback buffer (i.e. a first storage area), and storing the quadruples with the prize value less than 0.5 in a failed playback buffer (i.e. a third storage area).

Obtaining a successful experience matrix

Wherein, formula (1) is expressed as follows:

S1, training an LSTM (long-short-term memory network model) by using successful experience (namely an experience sample for training stored in a first storage area), and then saving the processed experience (namely a predicted experience sample) by using the LSTM to process failed experience (namely an experience sample for training stored in a third storage area).

For example S2, training LSTM using successful experience when the number of experiences in the pool of successful experiences meets the number requirement, and then using the network bandwidth decision experience of LSTM processing failure to obtain a predictive experience matrix (i.e., predictive experience sample)

S3, collecting a mixed experience set (namely a combined target training sample) from different experience pools, and training DDQN a network (namely a deep reinforcement learning model).

For example S3, using equations (2) and (3) to calculate the priorities of the samples in the successful experience pool (i.e., the first storage area) and the predicted experience pool (i.e., the second storage area), the high priority experience samples are preferentially extracted until 500 playback experience samples are extracted, resulting in a mixed experience matrix

Wherein, formula (2) is expressed as follows:

Equation (3) is expressed as follows:

The network is trained DDQN using a mixed experience set and a latest experience set (training experience sample corresponding to the current network environment state), and the epsilon-greedy strategy in formulas (5) - (8) is adopted to reduce the loss between the target Q value (namely the real state action value) and the predicted Q value (namely the predicted state action value), wherein sigma takes 1, lambda _increment takes 1.1, the threshold initial value takes 0.75, and epsilon initial values of all states are set to 0.5. Finally obtaining a target Q value through a formula (4) The target Q value will be used to evaluate and select the action with the highest Q value (i.e., the star network node corresponding to the greatest state action value performs the communication traffic). And continuously performing action selection through the target Q value, and finally reasonably distributing network bandwidth resources, thereby realizing the optimization of the star-ground fusion network performance.

Wherein, formula (4) is expressed as follows:

Equation (5) is expressed as follows:

equation (6) is expressed as follows:

Equation (7) is expressed as follows:

equation (8) is expressed as follows:

It should be noted that, the method of the embodiment of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the method of an embodiment of the present application, the devices interacting with each other to accomplish the method.

It should be noted that the foregoing describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the application also provides a device for distributing bandwidth resources of a satellite communication network, which corresponds to the method of any embodiment.

Referring to fig. 3, the allocation apparatus of bandwidth resources of a satellite communication network includes:

A priority determining module 301 configured to determine a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, where the training experience samples represent execution parameters of a communication service;

A sample selection module 302 configured to select training experience samples from among the training experience samples of the first storage area according to the first priority, and to select training experience samples from among the training experience samples of the second storage area according to the second priority;

A sample combination module 303, configured to obtain training experience samples corresponding to the current network environment state, and combine the training experience samples selected in the first storage area, the training experience samples selected in the second storage area, and the training experience samples corresponding to the current network environment state to obtain a combined target training sample;

A training module 304 configured to train the pre-constructed deep reinforcement learning model by using the combined target training sample to obtain a trained deep reinforcement learning model;

A value determining module 305, configured to obtain real-time network environment status information, and based on the real-time network environment status information, determine status action values obtained by selecting each star network node to execute a communication service by using the trained deep reinforcement learning model, respectively;

The resource allocation module 306 is configured to determine a maximum state action value of the state action values corresponding to the respective star network nodes, obtain a target network bandwidth occupation value when the star network node corresponding to the maximum state action value executes the communication service, and allocate the satellite communication network bandwidth resource to the star network node corresponding to the maximum state action value according to the target network bandwidth occupation value.

In some embodiments, the allocation apparatus of satellite communication network bandwidth resources further comprises a storage module, which is specifically configured to:

a first acquisition unit configured to acquire a current transmission demand state of any one of a plurality of user terminals;

A second obtaining unit configured to select a target star network node from a plurality of star network nodes based on the current transmission demand state to perform a communication service, and obtain a network bandwidth limit value of a transmission channel between the any one user terminal and the target star network node, a network bandwidth occupation value when the communication service is performed, and a next transmission demand state of the current transmission demand state;

A prize value determining unit configured to determine a prize value using the network bandwidth limit and the network bandwidth occupancy value;

A first storage unit configured to store the training experience sample formed by combining the current transmission demand state, the target star network node, the reward value, and the next transmission demand state to the first storage area or a preset third storage area according to the reward value;

The training unit is configured to train the pre-constructed long-period memory network model by using the training experience sample stored in the first storage area to obtain a trained long-period memory network model;

And the second storage unit is configured to input the training experience sample stored in the third storage area into a trained long-term and short-term memory network model for prediction to obtain a prediction experience sample, and store the prediction experience sample into the second storage area.

In some embodiments, the prize value determining unit is specifically configured to:

Acquiring the number of the star network nodes;

In some embodiments, the first storage unit is specifically configured to:

In some embodiments, the priority determination module 301 is specifically configured to:

A first priority of each stored training experience sample, comprising:

In some embodiments, training module 304 is specifically configured to:

In some embodiments, the real-time network environment status information includes: current star network state parameters;

The value determination module 305 is specifically configured to:

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

The device of the foregoing embodiment is configured to implement the corresponding method for allocating bandwidth resources of a satellite communication network in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method for allocating bandwidth resources of the satellite communication network according to any embodiment when executing the program.

Fig. 4 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 401, a memory 402, an input/output interface 403, a communication interface 404, and a bus 405. Wherein the processor 401, the memory 402, the input/output interface 403 and the communication interface 404 are in communication connection with each other inside the device via a bus 405.

The processor 401 may be implemented by a general purpose CPU (Central Processing Unit ), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 402 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage, dynamic storage, etc. Memory 402 may store an operating system and other application programs, and when implementing the solutions provided by the embodiments of the present specification by software or firmware, the relevant program code is stored in memory 402 and invoked for execution by processor 401.

The input/output interface 403 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 404 is used to connect a communication module (not shown in the figure) to enable communication interaction between the present device and other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 405 includes a path to transfer information between components of the device (e.g., processor 401, memory 402, input/output interface 403, and communication interface 404).

It should be noted that, although the above device only shows the processor 401, the memory 402, the input/output interface 403, the communication interface 404, and the bus 405, in the implementation, the device may further include other components necessary for realizing normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the method for allocating bandwidth resources of a satellite communication network according to any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the present application also provides a non-transitory computer readable storage medium corresponding to the method of any embodiment, wherein the non-transitory computer readable storage medium stores computer instructions for causing the computer to execute the method for allocating bandwidth resources of a satellite communication network according to any embodiment.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to execute the method for allocating bandwidth resources of a satellite communication network according to any one of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the application as described above, which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, and the like, which are within the spirit and principles of the embodiments of the application, are intended to be included within the scope of the application.

Claims

1. A method for allocating bandwidth resources of a satellite communication network, comprising:

acquiring an experience sample for training corresponding to the current network environment state, and combining the experience sample for training selected in the first storage area, the experience sample for training selected in the second storage area and the experience sample for training corresponding to the current network environment state to obtain a sample for combined target training;

2. The method of claim 1, wherein prior to determining the first priority of each training empirical sample stored in the preset first storage area and the second priority of each training empirical sample stored in the preset second storage area, the method further comprises:

3. The method of claim 2, wherein said determining a prize value using said network bandwidth limit and said network bandwidth occupancy value comprises:

Acquiring the number of the star network nodes;

4. The method according to claim 2, wherein the storing the training experience sample formed by combining the current transmission demand state, the target star network node, the bonus value, and the next transmission demand state according to the bonus value into the first storage area or a preset third storage area includes:

5. The method of claim 1, wherein determining the first priority of each training experience sample stored in the preset first storage area comprises:

6. The method of claim 1, wherein determining the second priority of each training experience sample stored in the preset second storage area comprises:

7. The method of claim 1, wherein training the pre-constructed deep reinforcement learning model using the combined target training sample to obtain a trained deep reinforcement learning model comprises:

8. The method of claim 1, wherein the real-time network environment status information comprises: current star network state parameters;

9. An apparatus for allocating bandwidth resources of a satellite communication network, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 8 when the program is executed by the processor.