CN118282471A - Method for distributing bandwidth resources of satellite communication network and related equipment - Google Patents

Method for distributing bandwidth resources of satellite communication network and related equipment Download PDF

Info

Publication number
CN118282471A
CN118282471A CN202410182515.1A CN202410182515A CN118282471A CN 118282471 A CN118282471 A CN 118282471A CN 202410182515 A CN202410182515 A CN 202410182515A CN 118282471 A CN118282471 A CN 118282471A
Authority
CN
China
Prior art keywords
training
value
state
storage area
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410182515.1A
Other languages
Chinese (zh)
Inventor
欧清海
姜燕
蒋月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongdian Feihua Communication Co Ltd
Original Assignee
Beijing Zhongdian Feihua Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongdian Feihua Communication Co Ltd filed Critical Beijing Zhongdian Feihua Communication Co Ltd
Priority to CN202410182515.1A priority Critical patent/CN118282471A/en
Publication of CN118282471A publication Critical patent/CN118282471A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application provides a method for distributing satellite communication network bandwidth resources and related equipment, which considers the importance of samples in different storage areas, distinguishes the importance of the samples in different storage areas so as to improve the training efficiency and precision of a deep reinforcement learning model, and can solve the problem that fixed bandwidth distribution resources cannot adapt to the dynamic change of a satellite network and the flexible requirements of different communication services because training experience samples corresponding to the current network environment state are added in the combined target training samples, so that the strategy learned by the deep reinforcement learning model is not applicable to the current network environment, the precision of the deep reinforcement learning model can be improved, and the satellite communication network bandwidth resources are distributed to the satellite network nodes corresponding to the maximum state action value according to the target network bandwidth occupation value, thereby guaranteeing the full utilization and performance of the bandwidth resources.

Description

Method for distributing bandwidth resources of satellite communication network and related equipment
Technical Field
The present application relates to the field of communications technologies, and in particular, to a method and related device for allocating bandwidth resources of a satellite communications network.
Background
In the context of power system operation, satellite communications may provide wide area coverage enabling remote power plants and distributed energy facilities to access the power system without impediment and to transmit critical information such as energy production in real time. Because the bandwidth resources of the satellite fusion network are limited and the communication requirements of service diversification need to be met, the distribution requirements of the bandwidth resources are more refined and stricter, and for the satellite fusion network, the better bandwidth resource distribution mode can obviously improve the efficiency and the utilization rate of a communication system, optimize the network performance and improve the service experience.
However, the conventional network bandwidth resource allocation strategy at present fixes bandwidth allocation resources, cannot adapt to dynamic changes of a satellite network and flexible requirements of different communication services, and easily causes bandwidth resource waste and low performance.
Disclosure of Invention
Accordingly, an objective of the present application is to provide a method and related equipment for allocating bandwidth resources of a satellite communication network, which are used for solving or partially solving the above-mentioned technical problems.
Based on the above object, a first aspect of the present application provides a method for allocating bandwidth resources of a satellite communication network, including:
Determining a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services;
selecting training experience samples from the training experience samples in the first storage area according to the first priority, and selecting training experience samples from the training experience samples in the second storage area according to the second priority;
Acquiring an experience sample for training corresponding to the current network environment state, and combining the experience sample for training in the first storage area, the experience sample for training selected in the second storage area and the experience sample for training corresponding to the current network environment state to obtain a sample for combined target training;
Training a pre-constructed deep reinforcement learning model by utilizing the combined target training sample to obtain a trained deep reinforcement learning model;
acquiring real-time network environment state information, and respectively determining state action values obtained by selecting each star network node to execute communication service by using the trained deep reinforcement learning model based on the real-time network environment state information;
Determining the maximum state action value in the state action values corresponding to all the star network nodes, acquiring a target network bandwidth occupation value when the star network node corresponding to the maximum state action value executes communication service, and distributing satellite communication network bandwidth resources to the star network node corresponding to the maximum state action value according to the target network bandwidth occupation value.
Optionally, before determining the first priority of each training experience sample stored in the preset first storage area and the second priority of each training experience sample stored in the preset second storage area, the method further includes:
acquiring the current transmission demand state of any one of a plurality of user terminals;
Selecting a target star network node from a plurality of star network nodes to execute communication service based on the current transmission demand state, and acquiring a network bandwidth limit value of a transmission channel between any user terminal and the target star network node, a network bandwidth occupation value when executing the communication service and a next transmission demand state of the current transmission demand state;
Determining a prize value using the network bandwidth limit and the network bandwidth occupancy value;
Storing the training experience sample formed by combining the current transmission demand state, the target star network node, the reward value and the next transmission demand state to the first storage area or a preset third storage area according to the reward value;
Training a pre-constructed long-period memory network model by using the training experience sample stored in the first storage area to obtain a trained long-period memory network model;
And inputting the training experience sample stored in the third storage area into a trained long-term and short-term memory network model for prediction to obtain a prediction experience sample, and storing the prediction experience sample into the second storage area.
Optionally, the determining the prize value using the network bandwidth limit and the network bandwidth occupancy value includes:
Acquiring the number of the star network nodes;
Performing product processing by using the number of the star network nodes and the network bandwidth limit value to obtain a first product processing result;
and carrying out ratio processing on the network bandwidth occupation value and the first product processing result to obtain the rewarding value.
Optionally, the storing the training experience sample formed by combining the current transmission requirement state, the target star network node, the reward value and the next transmission requirement state according to the reward value in the first storage area or a preset third storage area includes:
Judging whether the rewarding value is larger than or equal to a preset rewarding value threshold value or not, and obtaining a judging result;
Responding to the judgment result, and storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the first storage area; or alternatively
And if the judgment result is negative, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the third storage area.
Optionally, the determining the first priority of each training experience sample stored in the preset first storage area includes:
determining a first cosine similarity between the selected training empirical samples in the first storage area and the unselected training empirical samples in the first storage area;
Acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to a current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to a next transmission demand state of the current transmission demand state;
performing difference processing on the number of the selected training experience samples and the first cosine similarity to obtain a first difference processing result;
Performing ratio processing by using the first difference value processing result and the number of the selected training experience samples to obtain a first ratio processing result;
And carrying out product processing on the first ratio processing result and the absolute value of the time difference error to obtain the first priority.
Optionally, the determining the second priority of each training experience sample stored in the preset second storage area includes:
Determining a second cosine similarity between the selected training empirical samples in the second storage area and the unselected training empirical samples in the second storage area;
Acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to a current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to a next transmission demand state of the current transmission demand state;
performing difference processing on the number of the selected training experience samples and the second cosine similarity to obtain a second difference processing result;
performing ratio processing by using the second difference value processing result and the number of the selected training experience samples to obtain a second ratio processing result;
And carrying out product processing on the second ratio processing result and the absolute value of the time difference error to obtain the second priority.
Optionally, training the pre-constructed deep reinforcement learning model by using the combined target training sample to obtain a trained deep reinforcement learning model, including:
acquiring a real state action value corresponding to the combined target training sample;
inputting the combined target training sample into the pre-built deep reinforcement learning model, and outputting a predicted state action value through the pre-built deep reinforcement learning model;
And constructing a loss function based on the real state action value and the predicted state action value, performing minimization treatment on the loss function by using an epsilon-greedy strategy, and performing training adjustment on the pre-constructed deep reinforcement learning model according to the minimization treatment result to obtain a trained deep reinforcement learning model.
Optionally, the real-time network environment status information includes: current star network state parameters;
Based on the real-time network environment state information, the state action value obtained by selecting each star network node to execute the communication service is respectively determined by utilizing the trained deep reinforcement learning model, and the method comprises the following steps:
Acquiring an instant rewarding value and a discount factor obtained by executing any star network node under the current star network state parameter, and selecting a current state action value obtained by the star network node corresponding to the current maximum state action value under the current star network state parameter;
performing product processing by using the discount factor and the action value of the current state to obtain a second product processing result;
And adding the second product processing result and the instant rewarding value to obtain a state action value obtained by selecting each star network node to execute communication service.
Based on the same inventive concept, a second aspect of the present application provides an allocation apparatus of bandwidth resources of a satellite communication network, including:
A priority determining module configured to determine a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services;
A sample selection module configured to select training experience samples from among the training experience samples of the first storage area according to the first priority, and to select training experience samples from among the training experience samples of the second storage area according to the second priority;
The sample combination module is configured to acquire training experience samples corresponding to the current network environment state, and combine the training experience samples selected in the first storage area, the training experience samples selected in the second storage area and the training experience samples corresponding to the current network environment state to acquire a combined target training sample;
The training module is configured to train the pre-constructed deep reinforcement learning model by utilizing the combined target training sample to obtain a trained deep reinforcement learning model;
The value determining module is configured to acquire real-time network environment state information, and based on the real-time network environment state information, respectively determine state action values obtained by selecting each star network node to execute communication service by utilizing the trained deep reinforcement learning model;
The resource allocation module is configured to determine the largest state action value in the state action values corresponding to the star network nodes, acquire a target network bandwidth occupation value when the star network node corresponding to the largest state action value executes communication service, and allocate satellite communication network bandwidth resources to the star network node corresponding to the largest state action value according to the target network bandwidth occupation value.
A third aspect of the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.
As can be seen from the above, the method and the related device for allocating bandwidth resources of a satellite communication network provided by the application fully consider the importance of samples in different storage areas by selecting training experience samples from the first storage area according to the first priority and selecting training experience samples from the second storage area according to the second priority, so as to improve the training efficiency and precision of a deep reinforcement learning model, and form a combined target training sample as the training input of the deep reinforcement learning model in a combined mode.
Drawings
In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow chart of a method for allocating bandwidth resources of a satellite communication network according to an embodiment of the present application;
fig. 2 is a schematic diagram of a flow chart for allocating bandwidth resources of a satellite communication network according to an embodiment of the present application;
FIG. 3 is a block diagram illustrating a configuration of an apparatus for allocating bandwidth resources of a satellite communication network according to an embodiment of the present application;
Fig. 4 is a schematic diagram of an electronic device according to an embodiment of the application.
Detailed Description
The present application will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present application more apparent.
It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
It will be appreciated that before using the technical solutions of the embodiments of the present application, the user is informed of the type, the range of use, the use scenario, etc. of the related personal information in an appropriate manner, and the authorization of the user is obtained.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Therefore, the user can select whether to provide personal information to the software or hardware such as the electronic equipment, the application program, the server or the storage medium for executing the operation of the technical scheme according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization acquisition process is merely illustrative, and not limiting of the implementation of the present application, and that other ways of satisfying relevant legal regulations may be applied to the implementation of the present application.
Embodiments of the present application are described in detail below with reference to the accompanying drawings.
Bandwidth allocation in a satellite converged network refers to the process of efficiently allocating and managing bandwidth to meet communication traffic demands. In the context of power system operation, satellite communications may provide wide area coverage enabling remote power plants and distributed energy facilities to access the power system without impediment and to transmit critical information such as energy production in real time. For the satellite fusion network, the better network bandwidth resource allocation mode can obviously improve the efficiency and the utilization rate of the communication system, optimize the network performance and improve the service experience.
Because the bandwidth resources of the satellite fusion network are limited and the communication requirements of service diversification need to be met, the allocation requirements of the bandwidth resources are more refined and stricter.
For the satellite fusion network, the better bandwidth resource allocation mode can obviously improve the efficiency and the utilization rate of the communication system, optimize the network performance and improve the service experience. However, the conventional network bandwidth resource allocation strategy at present fixes bandwidth allocation resources, cannot adapt to dynamic changes of a satellite network and flexible requirements of different communication services, and easily causes bandwidth resource waste and low performance.
According to the method for allocating the bandwidth resources of the satellite communication network, the training experience samples are selected from the first storage area according to the first priority, the training experience samples are selected from the second storage area according to the second priority, the importance of the samples in different storage areas is fully considered, the importance of the samples in different storage areas is distinguished, so that the training efficiency and the accuracy of a deep reinforcement learning model are improved, a combined target training sample is formed in a combined mode to serve as the input of the training of the deep reinforcement learning model, and because the training experience samples corresponding to the current network environment state are added in the combined target training sample, the strategy of deep reinforcement learning model learning is not suitable for the current network environment, the accuracy of the deep reinforcement learning model is improved, the large state action value determined by the deep reinforcement learning model is more accurate, satellite communication network bandwidth resources are allocated to satellite network nodes corresponding to the largest state action value according to the target network bandwidth occupation value, the problem that fixed bandwidth allocation resources cannot be suitable for dynamic change of the satellite network and flexible requirements of different communication services can be avoided, and the bandwidth resources can be fully utilized.
As shown in fig. 1, the method of the present embodiment includes:
Step 101, determining a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services.
In this step, the conventional approach employs uniform random sampling to select the empirical samples for training the deep reinforcement learning model, enabling the deep reinforcement learning model to converge faster, but without taking into account the difference in useful information provided by the different empirical samples.
The method fully considers the importance of the samples in different storage areas, distinguishes the importance of the samples in different storage areas, and improves the training efficiency and the training precision of the deep reinforcement learning model by determining the first priority of each training experience sample stored in a preset first storage area and the second priority of each training experience sample stored in a preset second storage area.
Step 102, selecting training experience samples from the training experience samples in the first storage area according to the first priority, and selecting training experience samples from the training experience samples in the second storage area according to the second priority.
In this step, training empirical samples are selected from the first storage area based on the first priority, and training empirical samples are selected from the second storage area using the second priority such that the selected empirical samples contain as much useful information as possible, thereby reducing the number of states that the deep reinforcement learning model must explore or utilize, helping the deep reinforcement learning model to converge quickly.
And 103, acquiring training experience samples corresponding to the current network environment state, and combining the training experience samples selected in the first storage area, the training experience samples selected in the second storage area and the training experience samples corresponding to the current network environment state to obtain a combined target training sample.
In the step, because the training experience sample corresponding to the current network environment state is added in the combined target training sample, the combined target training sample is used as the input of the deep reinforcement learning model training, so that the condition that the deep reinforcement learning model learning strategy is not suitable for the current network environment can be avoided, and the precision of the deep reinforcement learning model can be improved.
And 104, training a pre-constructed deep reinforcement learning model by using the combined target training sample to obtain a trained deep reinforcement learning model.
In this step, the trained Deep reinforcement learning model may be a Deep Q-network (DQN) model, a modified Deep DQN (DDQN) model, and a deterministic strategy gradient model using Deep learning (DEEP DETERMINISTIC Policy Gradien, DDPG) model, preferably a DDQN model.
The deep reinforcement learning model combines a neural network with the reinforcement learning algorithm to solve the decision problem with delay rewards, so that the deep reinforcement learning model can learn autonomously from the environment and make decisions according to learned experiences.
Deep reinforcement learning is performed during training, and learning and lifting are performed by interacting with the environment. By observing the state of the environment, appropriate actions are selected and performed, and then a reward signal is obtained from the environment as feedback. The goal is to find a strategy to maximize long-term jackpot by learning. In adaptive network resource allocation, by modeling the network bandwidth resource allocation problem as a reinforcement learning problem, an optimal network bandwidth resource allocation strategy can be learned by interacting with the environment. The deep reinforcement learning has complex environment and task modeling capability, and can learn an automatic optimization strategy, so that the method is very suitable for solving the problem of self-adaptive bandwidth resource allocation in a star-ground fusion network.
In addition, the importance of the samples in different storage areas is fully considered by the combined target training sample, so that the training efficiency and the training precision of the deep reinforcement learning model are improved, the combined target training sample is formed in a combined mode to serve as the input of the deep reinforcement learning model training, and the training experience sample corresponding to the current network environment state is added into the combined target training sample, so that the situation that the deep reinforcement learning model learning strategy is not suitable for the current network environment can be avoided, and the precision of the deep reinforcement learning model can be improved.
Step 105, acquiring real-time network environment state information, and based on the real-time network environment state information, respectively determining state action values obtained by selecting each star network node to execute communication service by using the trained deep reinforcement learning model.
In the step, the trained deep reinforcement learning model can be suitable for real-time network environments at different moments, so that the state action value obtained by selecting each star network node to execute communication service can be more accurate by utilizing the trained deep reinforcement learning model.
Step 106, determining the largest state action value in the state action values corresponding to all the star network nodes, obtaining a target network bandwidth occupation value when the star network node corresponding to the largest state action value executes communication service, and distributing satellite communication network bandwidth resources to the star network node corresponding to the largest state action value according to the target network bandwidth occupation value.
In the step, the large state action value determined by the deep reinforcement learning model can be more accurate, and further satellite network bandwidth resources are allocated to the satellite network nodes corresponding to the maximum state action value according to the target network bandwidth occupation value so as to adapt to the dynamic change of the satellite network and the flexible requirements of different communication services, and in addition, the full utilization and performance of the bandwidth resources can be ensured, so that the problem that the traditional fixed bandwidth allocation resources cannot adapt to the dynamic change of the satellite network and the flexible requirements of different communication services can be solved.
According to the scheme, the training experience samples are selected from the first storage area according to the first priority, the training experience samples are selected from the second storage area according to the second priority, the importance of the samples in different storage areas is fully considered, the importance of the samples in different storage areas is distinguished, the training efficiency and the accuracy of the deep reinforcement learning model are improved, the combined target training samples are formed in a combined mode to serve as the input of the deep reinforcement learning model training, and because the training experience samples corresponding to the current network environment state are added in the combined target training samples, the strategy of deep reinforcement learning model learning is prevented from being inapplicable to the current network environment, the accuracy of the deep reinforcement learning model is improved, the large state action value determined by the deep reinforcement learning model is more accurate, the satellite communication network bandwidth resource is further distributed to the satellite network nodes corresponding to the largest state action value according to the target network bandwidth occupation value, the problem that fixed bandwidth distribution resources cannot adapt to the dynamic change of a satellite network and the flexible requirements of different communication services can be avoided, and the full utilization and the performance of bandwidth resources can be guaranteed.
In some embodiments, prior to step 101, the method further comprises:
Step A1, a current transmission demand state of any one of a plurality of user terminals is obtained.
And step A2, selecting a target star network node from a plurality of star network nodes to execute communication service based on the current transmission demand state, and acquiring a network bandwidth limit value of a transmission channel between any user terminal and the target star network node, a network bandwidth occupation value when executing the communication service and a next transmission demand state of the current transmission demand state.
And A3, determining a reward value by utilizing the network bandwidth limit value and the network bandwidth occupation value.
And step A4, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the reward value and the next transmission demand state into the first storage area or a preset third storage area according to the reward value.
And step A5, training the pre-constructed long-short-period memory network model by using the training experience sample stored in the first storage area to obtain a trained long-short-period memory network model.
And step A6, inputting the training experience sample stored in the third storage area into a trained long-term and short-term memory network model for prediction to obtain a prediction experience sample, and storing the prediction experience sample into the second storage area.
In the scheme, modeling is performed on the satellite fusion network to obtain a set N of user terminals, a set M of satellite network nodes and an action space A n. In each time step, the current transmission demand state S is acquired from the network environment, and an action a is selected based on this state, representing the selection of a certain star network node for communication traffic. This action is performed in the satellite fusion network and then returns the next transmission demand state s' and the prize value for that decision. At this point, it is determined whether the action selection was successful based on the prize value, and if so, the quadruple (s, a, r, s') is placed in a successful playback buffer (i.e., the first storage region). If not, it is placed in the failed playback buffer (i.e., the third storage region).
Specifically, the satellite fusion network under consideration is modeled as follows: the set of user terminals is n= {1,2, …, M }, and the set of star network nodes is m= {1,2, …, M }. The action space of any user is a n = {0,1,2, …, M }, that is, the user can select any node in the set M of the star network nodes to perform task data transmission. The state space s is represented by the current transmission demand state of each user, where 0 indicates that the current user has no transmission demand and 1 indicates that the current user has a transmission demand. Each time user n transmits task data to a star network node m, a certain bandwidth is occupied, which is denoted as b nm, and the maximum bandwidth limit (i.e., network bandwidth limit) of each transmission channel is b max. Each user terminal corresponds to at most one transmission task at a time.
After the user terminal performs the communication service at the star network node at the selection time t, a general prize (i.e. a prize value) r t(st,at for the star network is obtained.
And determining whether the action of the current round is successful or not according to the universal rewards of the bandwidth allocation decision of each round of the user. Successful experiences are stored in corresponding successful experience buffers, and failed experiences are temporarily stored in failed experience buffers.
When the user accumulates a certain amount of successful experience in the bandwidth resource allocation decision execution of the star-to-ground converged network, the LSTM network (i.e., the long-term memory network model) is trained using the successful experience (i.e., the training experience sample stored in the first storage region). The LSTM network is then used to process the user's failed experience (i.e., training experience samples stored in the third storage area) and store the processed experience (i.e., predicted experience samples) in the predicted experience buffer (i.e., the second storage area) to extract potential patterns that maximize the use of the user experience data.
Specifically, successful experience is utilized to train the LSTM network, so that the LSTM network can learn and predict bandwidth allocation decision data in a star-ground fusion network environment.
The failed experience is passed to the network for prediction. The result of the prediction is stored in a predicted playback buffer for subsequent sampling.
When a certain amount of effective experience is accumulated, samples are uniformly and randomly extracted from different experience pools according to preset weights. The sampled experience is input into DDQN networks (i.e., deep reinforcement learning models) to calculate Q values (i.e., state action values) and loss is calculated to update the networks until the main network converges. And optimizing the network bandwidth resource allocation decision according to the obtained continuously optimized DDQN target network so as to realize the high-efficiency self-adaptive allocation of the star-ground fusion network bandwidth and improve the overall performance of the star network.
Specifically, by mixing experience samples from different sources as inputs to the neural network (i.e., the deep reinforcement learning model), a mixed experience set E (i.e., the combined target training sample) is finally obtained. Sample sources include sample E 0 from a pool of successful experiences (i.e., the training experience sample selected in the first storage area), sample E 1 from a pool of predicted experiences (i.e., the training experience sample selected in the second storage area), and the most recent sample E 2 from the current star network environment (i.e., the training experience sample corresponding to the current network environment state). The purpose of adding the sample generated by the latest interaction with the environment is to add the sample closest to the current state of the star network for learning, and the strategy for preventing learning is not applicable to the current network environment.
The DDQN network is trained using the data set obtained after empirical mixed sampling (i.e., combining the target training samples) to obtain a star-to-ground fusion network bandwidth resource allocation decision.
In some embodiments, step A3 comprises:
And step A31, obtaining the number of the star network nodes.
And step A33, performing product processing by using the number of the star network nodes and the network bandwidth limit value to obtain a first product processing result.
And step A33, carrying out ratio processing on the network bandwidth occupation value and the first product processing result to obtain the rewarding value.
In the above scheme, the number M of the star network nodes and the maximum bandwidth (i.e., the network bandwidth limit) b max of each transmission channel are used to perform product processing, so as to obtain a first product processing result Mb max.
Bandwidth occupied when transmitting task data to star network node a n for user terminal nAnd the first product processing result Mb max to obtain a prize value r t(st,at).
The method can be expressed as follows:
Comprehensively considering the number of the star network nodes and the network bandwidth limit value can enable the determined rewarding value to be more accurate.
In some embodiments, step A4 comprises:
and step A41, judging whether the rewarding value is larger than or equal to a preset rewarding value threshold value, and obtaining a judging result.
And step A42, in response to the judgment result being yes, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewards value and the next transmission demand state into the first storage area. Or alternatively
And step A43, in response to the judging result being no, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state in the third storage area.
In the above scheme, when the reward value is greater than or equal to the preset reward value threshold, indicating that the action selection is successful, storing a training experience sample formed by combining the current transmission demand state, the target star network node, the reward value and the next transmission demand state into the first storage area.
And when the rewarding value is smaller than a preset rewarding threshold value, indicating that the action selection is not effective, and storing a training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into a third storage area.
For example, the prize value threshold is 0.5, when the prize value is greater than or equal to 0.5, the quadruple (s, a, r, s') is stored in the successful playback buffer (i.e., the first storage region), and the quadruple with the prize value less than 0.5 is stored in the failed playback buffer (i.e., the third storage region).
The quadruple is a training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state.
According to the reward value, importance distinction of samples is considered, and then the sample utilization rate and training efficiency are improved.
In some embodiments, in step 101, the determining the first priority of each training experience sample stored in the preset first storage area includes:
and B1, determining a first cosine similarity between the selected training experience samples in the first storage area and the unselected training experience samples in the first storage area.
And B3, acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to the current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to the next transmission demand state of the current transmission demand state.
And B3, carrying out difference processing on the number of the selected training experience samples and the first cosine similarity to obtain a first difference processing result.
And B4, carrying out ratio processing by using the first difference value processing result and the number of the selected training experience samples to obtain a first ratio processing result.
And step B5, carrying out product processing on the first ratio processing result and the absolute value of the time difference error to obtain the first priority.
In the above scheme, conventional DDQN employs uniform random sampling to select experience samples to train the network, enabling faster convergence of the network, but does not take into account the difference in useful information provided by the different experience samples in the experience playback pool.
The application provides a priority sampling mechanism based on cosine similarity and time difference, so that the extracted experience sample contains as much useful information as possible, thereby reducing the number of states which the network has to explore or utilize and helping the network to converge rapidly. Defining a first cosine similarity between the selected training empirical samples in the first storage region and the unselected training empirical samples in the first storage region as:
Further, the first priority of each experience sample is defined as:
Where S i is the ith experience tuple in the extracted experience sample (i.e., the selected training experience sample in the first storage area), n is the total number of extracted experience samples (i.e., the number of selected training experience samples), E t is the nth experience sample in the to-be-extracted experience sample (i.e., the unselected training experience samples in the first storage area), and delta t is the TD-err (time difference error) of the experience, i.e., the difference between the rewards of the current state and the expected rewards of the next state. And when sampling, sampling independently in sequence, sampling the sample with the highest priority each time, updating the priority of the sample to be sampled according to the current sampling result, and repeating the steps until enough sample sets are extracted for carrying out DDQN network updating.
In some embodiments, in step 101, the determining the second priority of each training experience sample stored in the preset second storage area includes:
And C1, determining second cosine similarity between the selected training experience samples in the second storage area and the unselected training experience samples in the second storage area.
And C2, acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to the current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to the next transmission demand state of the current transmission demand state.
And C3, carrying out difference processing on the number of the selected training experience samples and the second cosine similarity to obtain a second difference processing result.
And C4, carrying out ratio processing by using the second difference value processing result and the number of the selected training experience samples to obtain a second ratio processing result.
And step C5, carrying out product processing on the second ratio processing result and the absolute value of the time difference error to obtain the second priority.
In the above scheme, conventional DDQN employs uniform random sampling to select experience samples to train the network, enabling faster convergence of the network, but does not take into account the difference in useful information provided by the different experience samples in the experience playback pool.
The application provides a priority sampling mechanism based on cosine similarity and time difference, so that the extracted experience sample contains as much useful information as possible, thereby reducing the number of states which the network has to explore or utilize and helping the network to converge rapidly. Defining a second cosine similarity between the selected training empirical samples in the second storage region and the unselected training empirical samples in the second storage region as:
further, the second priority of each experience sample is defined as:
Where S i is the ith experience tuple in the extracted experience sample (i.e., the selected training experience sample in the second storage area), n is the total number of extracted experience samples (i.e., the number of selected training experience samples), E t is the nth experience sample in the to-be-extracted experience sample (i.e., the unselected training experience samples in the second storage area), and delta t is the TD-err (time difference error) of the experience, i.e., the difference between the rewards of the current state and the expected rewards of the next state. And when sampling, sampling independently in sequence, sampling the sample with the highest priority each time, updating the priority of the sample to be sampled according to the current sampling result, and repeating the steps until enough sample sets are extracted for carrying out DDQN network updating.
In some embodiments, step 104 comprises:
step 1041, obtaining a real state action value corresponding to the combined target training sample.
Step 1042, inputting the combined target training sample into the pre-built deep reinforcement learning model, and outputting the predicted state action value via the pre-built deep reinforcement learning model.
Step 1043, constructing a loss function based on the real state action value and the predicted state action value, performing minimization treatment on the loss function by using an epsilon-greedy strategy, and performing training adjustment on the pre-constructed deep reinforcement learning model according to the minimization treatment result to obtain a trained deep reinforcement learning model.
In the scheme, the deep reinforcement learning model updates the weight through a gradient descent method, and reduces the loss between the target Q value (namely the real state action value) and the predicted Q value (namely the predicted state action value) so as to realize the optimal bandwidth allocation decision as much as possible.
Because DDQN models have different learning degrees on the environment in the early stage and the later stage of the algorithm, the corresponding exploration and utilization probabilities should also be different, when the acquisition of the environment information in the early stage of the algorithm is less, the larger exploration probability should be adopted to acquire the environment information, and the smaller exploration probability should be adopted in the later stage of the algorithm to realize the development of the optimal strategy. In order to achieve a balance between detection and utilization of the algorithm, the application employs an epsilon adaptive adjustment mechanism that uses prize values obtained from the environment to determine whether to attenuate the rate of exploration. Only when a certain reward threshold is crossed, indicating that enough information is learned from the environment to support better decisions, the probability of detection can be reduced and the probability of utilization increased, the value of ε can be reduced, while the reward threshold is increased. In addition, after each state transition, the exploration probability epsilon(s) of each state is calculated according to the Boltzmann distribution difference of the value,
Wherein σ is a normal number, determining the influence of the selected action on the relevant state detection probability, and δ is the inverse of the action number of the state S, namelyThreadhold is the prize threshold, lambda increment is the prize threshold growth factor, taking a constant greater than 1.
Further, the ε -greedy strategy is expressed as follows:
Wherein p e is a random number of 0 to 1, and ε(s) (0 < ε(s) < 1) is the search probability calculated by the algorithm. The strategy randomly captures one action in action space |a(s) | with a probability of ε(s) to avoid trapping in local optimality.
And carrying out bandwidth allocation decision in real time according to the current network state and the predicted bandwidth demand by obtaining a bandwidth self-adaptive allocation decision scheme in the star-ground fusion network. The high-efficiency network bandwidth resource allocation of the star-ground fusion network is realized, the personalized requirements of various services are met, and the performance of the star-ground fusion network is improved.
In some embodiments, the real-time network environment status information includes: current star network state parameters.
In step 105, the determining, based on the real-time network environment status information, status action values obtained by selecting each star network node to execute the communication service by using the trained deep reinforcement learning model includes:
Step 1051, obtaining the instant prize value and discount factor obtained by executing any star network node under the current star network state parameter, and selecting the current state action value obtained by the star network node corresponding to the current maximum state action value under the current star network state parameter.
And step 1052, performing product processing by using the discount factor and the action value of the current state to obtain a second product processing result.
And 1053, adding the second product processing result and the instant rewards value to obtain a state action value obtained by selecting each star network node to execute the communication service.
In the above scenario, the deep reinforcement learning model reduces overestimation by decomposing the maximum operations in the target into action selection and action evaluation. Meanwhile, a calculation method of a target Q value in the deep Q-network is further improved, so that the Q value (namely the state action value) is more true.
The target Q value (i.e., state action value) is ultimately expressed as:
Wherein Y t is a target Q value, r t+1 is an instant rewarding value designed for the star-ground fusion network, which represents an instant rewarding obtained after the star-ground network node selection action (a t) is performed under the current star-ground network state parameter (s t), and gamma is a discount factor for measuring the importance of the future rewarding value. θ t represents the current Q network parameter, i.e. the network weight at the current time (t), The Q network parameter representing the target is used to calculate the target Q value.Representing an action (a t) of selecting the highest Q value (the current maximum state action value) given the satellite fusion network state parameters (s t+1),And the current state action value obtained by selecting the star network node corresponding to the current maximum state action value under the current star network state parameter is represented.
In some embodiments, as shown in fig. 2, S1, selects and performs an action (i.e., selects a target star network node to perform a communication service) according to a current state (i.e., a current transmission demand state), obtains a next time state (i.e., a next transmission demand state of the current transmission demand state) and a reward (i.e., a reward value), and stores experience classifications according to the reward.
For example, S1, selecting and executing a star-to-ground network node selection action according to the current state, obtaining the next time state, calculating a general prize (i.e. a prize value) according to the formula (1), storing the quadruples (S, a, r, S') with the prize value greater than 0.5 (i.e. a preset prize value threshold) in a successful playback buffer (i.e. a first storage area), and storing the quadruples with the prize value less than 0.5 in a failed playback buffer (i.e. a third storage area).
Obtaining a successful experience matrix
Wherein, formula (1) is expressed as follows:
S1, training an LSTM (long-short-term memory network model) by using successful experience (namely an experience sample for training stored in a first storage area), and then saving the processed experience (namely a predicted experience sample) by using the LSTM to process failed experience (namely an experience sample for training stored in a third storage area).
For example S2, training LSTM using successful experience when the number of experiences in the pool of successful experiences meets the number requirement, and then using the network bandwidth decision experience of LSTM processing failure to obtain a predictive experience matrix (i.e., predictive experience sample)
S3, collecting a mixed experience set (namely a combined target training sample) from different experience pools, and training DDQN a network (namely a deep reinforcement learning model).
For example S3, using equations (2) and (3) to calculate the priorities of the samples in the successful experience pool (i.e., the first storage area) and the predicted experience pool (i.e., the second storage area), the high priority experience samples are preferentially extracted until 500 playback experience samples are extracted, resulting in a mixed experience matrix
Wherein, formula (2) is expressed as follows:
Equation (3) is expressed as follows:
The network is trained DDQN using a mixed experience set and a latest experience set (training experience sample corresponding to the current network environment state), and the epsilon-greedy strategy in formulas (5) - (8) is adopted to reduce the loss between the target Q value (namely the real state action value) and the predicted Q value (namely the predicted state action value), wherein sigma takes 1, lambda increment takes 1.1, the threshold initial value takes 0.75, and epsilon initial values of all states are set to 0.5. Finally obtaining a target Q value through a formula (4) The target Q value will be used to evaluate and select the action with the highest Q value (i.e., the star network node corresponding to the greatest state action value performs the communication traffic). And continuously performing action selection through the target Q value, and finally reasonably distributing network bandwidth resources, thereby realizing the optimization of the star-ground fusion network performance.
Wherein, formula (4) is expressed as follows:
Equation (5) is expressed as follows:
equation (6) is expressed as follows:
Equation (7) is expressed as follows:
equation (8) is expressed as follows:
It should be noted that, the method of the embodiment of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the method of an embodiment of the present application, the devices interacting with each other to accomplish the method.
It should be noted that the foregoing describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same inventive concept, the application also provides a device for distributing bandwidth resources of a satellite communication network, which corresponds to the method of any embodiment.
Referring to fig. 3, the allocation apparatus of bandwidth resources of a satellite communication network includes:
A priority determining module 301 configured to determine a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, where the training experience samples represent execution parameters of a communication service;
A sample selection module 302 configured to select training experience samples from among the training experience samples of the first storage area according to the first priority, and to select training experience samples from among the training experience samples of the second storage area according to the second priority;
A sample combination module 303, configured to obtain training experience samples corresponding to the current network environment state, and combine the training experience samples selected in the first storage area, the training experience samples selected in the second storage area, and the training experience samples corresponding to the current network environment state to obtain a combined target training sample;
A training module 304 configured to train the pre-constructed deep reinforcement learning model by using the combined target training sample to obtain a trained deep reinforcement learning model;
A value determining module 305, configured to obtain real-time network environment status information, and based on the real-time network environment status information, determine status action values obtained by selecting each star network node to execute a communication service by using the trained deep reinforcement learning model, respectively;
The resource allocation module 306 is configured to determine a maximum state action value of the state action values corresponding to the respective star network nodes, obtain a target network bandwidth occupation value when the star network node corresponding to the maximum state action value executes the communication service, and allocate the satellite communication network bandwidth resource to the star network node corresponding to the maximum state action value according to the target network bandwidth occupation value.
In some embodiments, the allocation apparatus of satellite communication network bandwidth resources further comprises a storage module, which is specifically configured to:
a first acquisition unit configured to acquire a current transmission demand state of any one of a plurality of user terminals;
A second obtaining unit configured to select a target star network node from a plurality of star network nodes based on the current transmission demand state to perform a communication service, and obtain a network bandwidth limit value of a transmission channel between the any one user terminal and the target star network node, a network bandwidth occupation value when the communication service is performed, and a next transmission demand state of the current transmission demand state;
A prize value determining unit configured to determine a prize value using the network bandwidth limit and the network bandwidth occupancy value;
A first storage unit configured to store the training experience sample formed by combining the current transmission demand state, the target star network node, the reward value, and the next transmission demand state to the first storage area or a preset third storage area according to the reward value;
The training unit is configured to train the pre-constructed long-period memory network model by using the training experience sample stored in the first storage area to obtain a trained long-period memory network model;
And the second storage unit is configured to input the training experience sample stored in the third storage area into a trained long-term and short-term memory network model for prediction to obtain a prediction experience sample, and store the prediction experience sample into the second storage area.
In some embodiments, the prize value determining unit is specifically configured to:
Acquiring the number of the star network nodes;
Performing product processing by using the number of the star network nodes and the network bandwidth limit value to obtain a first product processing result;
and carrying out ratio processing on the network bandwidth occupation value and the first product processing result to obtain the rewarding value.
In some embodiments, the first storage unit is specifically configured to:
Judging whether the rewarding value is larger than or equal to a preset rewarding value threshold value or not, and obtaining a judging result;
Responding to the judgment result, and storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the first storage area; or alternatively
And if the judgment result is negative, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the third storage area.
In some embodiments, the priority determination module 301 is specifically configured to:
A first priority of each stored training experience sample, comprising:
determining a first cosine similarity between the selected training empirical samples in the first storage area and the unselected training empirical samples in the first storage area;
Acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to a current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to a next transmission demand state of the current transmission demand state;
performing difference processing on the number of the selected training experience samples and the first cosine similarity to obtain a first difference processing result;
Performing ratio processing by using the first difference value processing result and the number of the selected training experience samples to obtain a first ratio processing result;
And carrying out product processing on the first ratio processing result and the absolute value of the time difference error to obtain the first priority.
In some embodiments, the priority determination module 301 is specifically configured to:
Determining a second cosine similarity between the selected training empirical samples in the second storage area and the unselected training empirical samples in the second storage area;
Acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to a current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to a next transmission demand state of the current transmission demand state;
performing difference processing on the number of the selected training experience samples and the second cosine similarity to obtain a second difference processing result;
performing ratio processing by using the second difference value processing result and the number of the selected training experience samples to obtain a second ratio processing result;
And carrying out product processing on the second ratio processing result and the absolute value of the time difference error to obtain the second priority.
In some embodiments, training module 304 is specifically configured to:
acquiring a real state action value corresponding to the combined target training sample;
inputting the combined target training sample into the pre-built deep reinforcement learning model, and outputting a predicted state action value through the pre-built deep reinforcement learning model;
And constructing a loss function based on the real state action value and the predicted state action value, performing minimization treatment on the loss function by using an epsilon-greedy strategy, and performing training adjustment on the pre-constructed deep reinforcement learning model according to the minimization treatment result to obtain a trained deep reinforcement learning model.
In some embodiments, the real-time network environment status information includes: current star network state parameters;
The value determination module 305 is specifically configured to:
Acquiring an instant rewarding value and a discount factor obtained by executing any star network node under the current star network state parameter, and selecting a current state action value obtained by the star network node corresponding to the current maximum state action value under the current star network state parameter;
performing product processing by using the discount factor and the action value of the current state to obtain a second product processing result;
And adding the second product processing result and the instant rewarding value to obtain a state action value obtained by selecting each star network node to execute communication service.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
The device of the foregoing embodiment is configured to implement the corresponding method for allocating bandwidth resources of a satellite communication network in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method for allocating bandwidth resources of the satellite communication network according to any embodiment when executing the program.
Fig. 4 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 401, a memory 402, an input/output interface 403, a communication interface 404, and a bus 405. Wherein the processor 401, the memory 402, the input/output interface 403 and the communication interface 404 are in communication connection with each other inside the device via a bus 405.
The processor 401 may be implemented by a general purpose CPU (Central Processing Unit ), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 402 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage, dynamic storage, etc. Memory 402 may store an operating system and other application programs, and when implementing the solutions provided by the embodiments of the present specification by software or firmware, the relevant program code is stored in memory 402 and invoked for execution by processor 401.
The input/output interface 403 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
The communication interface 404 is used to connect a communication module (not shown in the figure) to enable communication interaction between the present device and other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 405 includes a path to transfer information between components of the device (e.g., processor 401, memory 402, input/output interface 403, and communication interface 404).
It should be noted that, although the above device only shows the processor 401, the memory 402, the input/output interface 403, the communication interface 404, and the bus 405, in the implementation, the device may further include other components necessary for realizing normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The electronic device of the foregoing embodiment is configured to implement the method for allocating bandwidth resources of a satellite communication network according to any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, the present application also provides a non-transitory computer readable storage medium corresponding to the method of any embodiment, wherein the non-transitory computer readable storage medium stores computer instructions for causing the computer to execute the method for allocating bandwidth resources of a satellite communication network according to any embodiment.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the foregoing embodiments stores computer instructions for causing the computer to execute the method for allocating bandwidth resources of a satellite communication network according to any one of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the application as described above, which are not provided in detail for the sake of brevity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, and the like, which are within the spirit and principles of the embodiments of the application, are intended to be included within the scope of the application.

Claims (10)

1. A method for allocating bandwidth resources of a satellite communication network, comprising:
Determining a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services;
selecting training experience samples from the training experience samples in the first storage area according to the first priority, and selecting training experience samples from the training experience samples in the second storage area according to the second priority;
acquiring an experience sample for training corresponding to the current network environment state, and combining the experience sample for training selected in the first storage area, the experience sample for training selected in the second storage area and the experience sample for training corresponding to the current network environment state to obtain a sample for combined target training;
Training a pre-constructed deep reinforcement learning model by utilizing the combined target training sample to obtain a trained deep reinforcement learning model;
acquiring real-time network environment state information, and respectively determining state action values obtained by selecting each star network node to execute communication service by using the trained deep reinforcement learning model based on the real-time network environment state information;
Determining the maximum state action value in the state action values corresponding to all the star network nodes, acquiring a target network bandwidth occupation value when the star network node corresponding to the maximum state action value executes communication service, and distributing satellite communication network bandwidth resources to the star network node corresponding to the maximum state action value according to the target network bandwidth occupation value.
2. The method of claim 1, wherein prior to determining the first priority of each training empirical sample stored in the preset first storage area and the second priority of each training empirical sample stored in the preset second storage area, the method further comprises:
acquiring the current transmission demand state of any one of a plurality of user terminals;
Selecting a target star network node from a plurality of star network nodes to execute communication service based on the current transmission demand state, and acquiring a network bandwidth limit value of a transmission channel between any user terminal and the target star network node, a network bandwidth occupation value when executing the communication service and a next transmission demand state of the current transmission demand state;
Determining a prize value using the network bandwidth limit and the network bandwidth occupancy value;
Storing the training experience sample formed by combining the current transmission demand state, the target star network node, the reward value and the next transmission demand state to the first storage area or a preset third storage area according to the reward value;
Training a pre-constructed long-period memory network model by using the training experience sample stored in the first storage area to obtain a trained long-period memory network model;
And inputting the training experience sample stored in the third storage area into a trained long-term and short-term memory network model for prediction to obtain a prediction experience sample, and storing the prediction experience sample into the second storage area.
3. The method of claim 2, wherein said determining a prize value using said network bandwidth limit and said network bandwidth occupancy value comprises:
Acquiring the number of the star network nodes;
Performing product processing by using the number of the star network nodes and the network bandwidth limit value to obtain a first product processing result;
and carrying out ratio processing on the network bandwidth occupation value and the first product processing result to obtain the rewarding value.
4. The method according to claim 2, wherein the storing the training experience sample formed by combining the current transmission demand state, the target star network node, the bonus value, and the next transmission demand state according to the bonus value into the first storage area or a preset third storage area includes:
Judging whether the rewarding value is larger than or equal to a preset rewarding value threshold value or not, and obtaining a judging result;
Responding to the judgment result, and storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the first storage area; or alternatively
And if the judgment result is negative, storing the training experience sample formed by combining the current transmission demand state, the target star network node, the rewarding value and the next transmission demand state into the third storage area.
5. The method of claim 1, wherein determining the first priority of each training experience sample stored in the preset first storage area comprises:
determining a first cosine similarity between the selected training empirical samples in the first storage area and the unselected training empirical samples in the first storage area;
Acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to a current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to a next transmission demand state of the current transmission demand state;
performing difference processing on the number of the selected training experience samples and the first cosine similarity to obtain a first difference processing result;
Performing ratio processing by using the first difference value processing result and the number of the selected training experience samples to obtain a first ratio processing result;
And carrying out product processing on the first ratio processing result and the absolute value of the time difference error to obtain the first priority.
6. The method of claim 1, wherein determining the second priority of each training experience sample stored in the preset second storage area comprises:
Determining a second cosine similarity between the selected training empirical samples in the second storage area and the unselected training empirical samples in the second storage area;
Acquiring the number of the selected training experience samples and a time difference error, wherein the time difference error is a difference value between a reward value corresponding to a current transmission demand state of any one of a plurality of user terminals and a reward value corresponding to a next transmission demand state of the current transmission demand state;
performing difference processing on the number of the selected training experience samples and the second cosine similarity to obtain a second difference processing result;
performing ratio processing by using the second difference value processing result and the number of the selected training experience samples to obtain a second ratio processing result;
And carrying out product processing on the second ratio processing result and the absolute value of the time difference error to obtain the second priority.
7. The method of claim 1, wherein training the pre-constructed deep reinforcement learning model using the combined target training sample to obtain a trained deep reinforcement learning model comprises:
acquiring a real state action value corresponding to the combined target training sample;
inputting the combined target training sample into the pre-built deep reinforcement learning model, and outputting a predicted state action value through the pre-built deep reinforcement learning model;
And constructing a loss function based on the real state action value and the predicted state action value, performing minimization treatment on the loss function by using an epsilon-greedy strategy, and performing training adjustment on the pre-constructed deep reinforcement learning model according to the minimization treatment result to obtain a trained deep reinforcement learning model.
8. The method of claim 1, wherein the real-time network environment status information comprises: current star network state parameters;
Based on the real-time network environment state information, the state action value obtained by selecting each star network node to execute the communication service is respectively determined by utilizing the trained deep reinforcement learning model, and the method comprises the following steps:
Acquiring an instant rewarding value and a discount factor obtained by executing any star network node under the current star network state parameter, and selecting a current state action value obtained by the star network node corresponding to the current maximum state action value under the current star network state parameter;
performing product processing by using the discount factor and the action value of the current state to obtain a second product processing result;
And adding the second product processing result and the instant rewarding value to obtain a state action value obtained by selecting each star network node to execute communication service.
9. An apparatus for allocating bandwidth resources of a satellite communication network, comprising:
A priority determining module configured to determine a first priority of each training experience sample stored in a preset first storage area and a second priority of each training experience sample stored in a preset second storage area, wherein the training experience samples represent execution parameters of communication services;
A sample selection module configured to select training experience samples from among the training experience samples of the first storage area according to the first priority, and to select training experience samples from among the training experience samples of the second storage area according to the second priority;
The sample combination module is configured to acquire training experience samples corresponding to the current network environment state, and combine the training experience samples selected in the first storage area, the training experience samples selected in the second storage area and the training experience samples corresponding to the current network environment state to acquire a combined target training sample;
The training module is configured to train the pre-constructed deep reinforcement learning model by utilizing the combined target training sample to obtain a trained deep reinforcement learning model;
The value determining module is configured to acquire real-time network environment state information, and based on the real-time network environment state information, respectively determine state action values obtained by selecting each star network node to execute communication service by utilizing the trained deep reinforcement learning model;
The resource allocation module is configured to determine the largest state action value in the state action values corresponding to the star network nodes, acquire a target network bandwidth occupation value when the star network node corresponding to the largest state action value executes communication service, and allocate satellite communication network bandwidth resources to the star network node corresponding to the largest state action value according to the target network bandwidth occupation value.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 8 when the program is executed by the processor.
CN202410182515.1A 2024-02-19 2024-02-19 Method for distributing bandwidth resources of satellite communication network and related equipment Pending CN118282471A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410182515.1A CN118282471A (en) 2024-02-19 2024-02-19 Method for distributing bandwidth resources of satellite communication network and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410182515.1A CN118282471A (en) 2024-02-19 2024-02-19 Method for distributing bandwidth resources of satellite communication network and related equipment

Publications (1)

Publication Number Publication Date
CN118282471A true CN118282471A (en) 2024-07-02

Family

ID=91637596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410182515.1A Pending CN118282471A (en) 2024-02-19 2024-02-19 Method for distributing bandwidth resources of satellite communication network and related equipment

Country Status (1)

Country Link
CN (1) CN118282471A (en)

Similar Documents

Publication Publication Date Title
CN109976909B (en) Learning-based low-delay task scheduling method in edge computing network
CN112291793B (en) Resource allocation method and device of network access equipment
CN110832509B (en) Black box optimization using neural networks
CN105739917A (en) Electronic system with learning mechanism and method of operation thereof
CN114021770A (en) Network resource optimization method and device, electronic equipment and storage medium
CN114253735B (en) Task processing method and device and related equipment
CN110046706A (en) Model generating method, device and server
CN115543577B (en) Covariate-based Kubernetes resource scheduling optimization method, storage medium and device
CN117041330B (en) Edge micro-service fine granularity deployment method and system based on reinforcement learning
CN114205317B (en) SDN and NFV-based service function chain SFC resource allocation method and electronic equipment
CN116684330A (en) Traffic prediction method, device, equipment and storage medium based on artificial intelligence
CN118282471A (en) Method for distributing bandwidth resources of satellite communication network and related equipment
CN116306981A (en) Policy determination method, device, medium and electronic equipment
CN115860363A (en) Resource dynamic scheduling adaptation method and system under limited demand scene
CN108681480B (en) Background application program control method and device, storage medium and electronic equipment
Su et al. Buffer evaluation model and scheduling strategy for video streaming services in 5G-powered drone using machine learning
CN118093057B (en) Notebook computer system resource optimization method and system based on user using habit
CN114697974B (en) Network coverage optimization method and device, electronic equipment and storage medium
CN117834547A (en) Method for transmitting data and related equipment
CN117311991B (en) Model training method, task allocation method, device, equipment, medium and system
CN117793805B (en) Dynamic user random access mobile edge computing resource allocation method and system
CN113850349B (en) Detection system and method with data identification function
CN117499251A (en) Active elastic resource telescoping method, device and management system without server
CN118278504A (en) Edge equipment data processing method based on federal learning and related equipment
CN117651344A (en) Network resource sharing method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication